Skip to content

fix(results): handle replay row schema aliases#1400

Merged
christso merged 3 commits into
mainfrom
av-results-row-schema-cleanup
Jun 17, 2026
Merged

fix(results): handle replay row schema aliases#1400
christso merged 3 commits into
mainfrom
av-results-row-schema-cleanup

Conversation

@christso

Copy link
Copy Markdown
Collaborator

Summary

Historical replay-shaped result rows now hydrate consistently across results and inspect: failed rows keep their testId, timing/token fields are preserved, and inspect filter --has-tool sees trace.toolCalls aliases.

The accepted results command input remains a run workspace or index.jsonl manifest. Canonical rows are still snake_case; this PR normalizes only the known historical camelCase aliases at the file boundary, then rejects eval-case-only JSONL with migration guidance instead of treating it as an unknown failed result.

Validation

  • bun test apps/cli/test/commands/results
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts
  • bun test packages/core/test/evaluation/trace-summary.test.ts
  • bun run typecheck
  • bun run lint
  • Push hook reran bun run typecheck and bun run lint successfully.

Red/Green UAT

Red, before the fix on the same replay-shaped row:

  • results summary returned failed_test_ids: ["unknown"], total_duration_ms: 0, and total_tokens: 0.
  • inspect filter --has-tool rg returned [].

Green, after the fix:

  • bun apps/cli/src/cli.ts results summary apps/cli/test/fixtures/results/camel-replay returns failed_test_ids: ["wtg-replay-fail"], total_duration_ms: 1234, and total_tokens: 15.
  • bun apps/cli/src/cli.ts inspect filter apps/cli/test/fixtures/results/camel-replay/index.jsonl --has-tool rg --format json returns the wtg-replay-fail row with tool_names: ["rg"].
  • Eval-case-only JSONL is rejected with: Unsupported result row ... Eval-case JSONL is input data, not a results artifact. Run agentv eval <eval-file> --output <run-dir>....

Post-Deploy Monitoring & Validation

No additional production monitoring required; this changes local CLI parsing of result artifacts only. After release, validate by running agentv results summary and agentv inspect filter --has-tool against a historical replay manifest if one is available. Watch CI and user reports for Unsupported result row, failed_test_ids containing unknown, or tool filters failing to match known trace summaries.


Compound Engineering
GPT-5

@christso

Copy link
Copy Markdown
Collaborator Author

Review Findings

P1 - Adjacent trace tests fail after the invalid-score error changed

apps/cli/src/commands/inspect/utils.ts:137 now routes standalone JSONL loading through normalizeResultRow(), so rows with a missing or non-numeric score throw Unsupported result row ... Expected an AgentV result row... instead of the prior Missing or invalid score... message. The existing trace utility contract at apps/cli/test/commands/trace/trace.test.ts:191 and apps/cli/test/commands/trace/trace.test.ts:198 still asserts the old message, and bun test apps/cli/test/commands/trace/trace.test.ts now fails those two tests. CI runs the full test suite, so this should block merge until the trace tests are updated to the new migration guidance or loadJsonlRecords() preserves the trace-specific invalid-score error.

Verification

  • bun install and bun run build were needed first because this fresh worktree lacked workspace package dist output.
  • bun test apps/cli/test/commands/results passed: 193 pass.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed: 45 pass.
  • bun run typecheck passed.
  • bun apps/cli/src/cli.ts results summary apps/cli/test/fixtures/results/camel-replay returned failed_test_ids: ["wtg-replay-fail"], total_duration_ms: 1234, and total_tokens: 15.
  • bun apps/cli/src/cli.ts inspect filter apps/cli/test/fixtures/results/camel-replay/index.jsonl --has-tool rg --format json returned the replay row with tool_names: ["rg"].
  • Adjacent check bun test apps/cli/test/commands/trace/trace.test.ts failed: 47 pass, 2 fail, both invalid-score message expectations.

Note: gh pr review --request-changes was rejected by GitHub because the authenticated user cannot request changes on their own pull request, so I am leaving this as a blocking review comment.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 239ab03
Status: ✅  Deploy successful!
Preview URL: https://1c0dafa8.agentv.pages.dev
Branch Preview URL: https://av-results-row-schema-cleanu.agentv.pages.dev

View logs

@christso

Copy link
Copy Markdown
Collaborator Author

Addressed the blocking trace-test comment in c234fd04.

Decision: preserved the trace-specific invalid-score contract for standalone trace/result JSONL loaded through loadResultFile(). That path accepts trace exports, so Missing or invalid score... is more precise there than the result-artifact migration guidance. Eval-case-only rows still go through the shared result-row schema error and keep the migration guidance.

Validation:

  • Reproduced before fix: bun test apps/cli/test/commands/trace/trace.test.ts failed 2 invalid-score expectations.
  • bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass.
  • bun test apps/cli/test/commands/results — 193 pass.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass.
  • bun run typecheck — pass.
  • bun run lint — pass.

@christso

Copy link
Copy Markdown
Collaborator Author

CI Test job follow-up after c234fd0 / failed job https://github.com/EntityProcess/agentv/actions/runs/27683502887/job/81876531953

Root cause from the log:

  • apps/cli/test/commands/compare/compare.test.ts still expected the result/trace-specific Missing or invalid score contract, but invalid result-shaped rows were being rejected by the shared migration-guidance schema error before compare's own validation ran.
  • apps/cli/test/commands/results/shared.test.ts built the camel replay fixture path from process.cwd(), which becomes apps/cli under the CLI package test job and produced apps/cli/apps/cli/....

Fix in 239ab033:

  • normalizeResultRow now keeps eval-case-only rows on the migration-guidance error, while result-shaped rows with missing/non-finite score keep the existing invalid-score schema error.
  • Nested trace/output normalization no longer writes undefined fields onto rows, so eval-case rows do not get misclassified as result-like.
  • The camel replay fixture path is resolved relative to shared.test.ts, not cwd.

Validation:

  • bun test apps/cli/test/commands/compare/compare.test.ts apps/cli/test/commands/results/shared.test.ts — 55 pass
  • bun test apps/cli/test/commands/results — 193 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass
  • bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass
  • bun test apps/cli/test/commands/compare/compare.test.ts — 50 pass
  • bun run typecheck — pass
  • bun run lint — pass
  • git diff --check — pass

@christso

Copy link
Copy Markdown
Collaborator Author

Re-review result

No blockers remain for the prior results-schema feedback.

Verified the requested points:

  • Trace invalid-score expectations are fixed: bun test apps/cli/test/commands/trace/trace.test.ts passed, 49 tests.
  • Historical camelCase replay rows still normalize: results/shared.test.ts and artifact-writer.test.ts cover the fixture path, timing/token/cost/status aliases, and trace.toolCalls; an extra smoke confirmed canonical test_id/duration_ms values win when both snake_case and camelCase are present.
  • Compare/results fixture lookup no longer depends on process.cwd(): root-run tests passed, and bun test test/commands/results/shared.test.ts test/commands/compare/compare.test.ts also passed from apps/cli cwd, 55 tests.
  • Canonical snake_case result rows remain preferred, while eval-case-only JSONL still rejects with the migration guidance (Eval-case JSONL is input data...).

Local verification:

  • bun install and bun run build for this fresh worktree setup.
  • bun test apps/cli/test/commands/trace/trace.test.ts — 49 pass.
  • bun test apps/cli/test/commands/results — 193 pass.
  • bun test apps/cli/test/commands/compare/compare.test.ts — 50 pass.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 45 pass.
  • bun test test/commands/results/shared.test.ts test/commands/compare/compare.test.ts from apps/cli — 55 pass.
  • bun run typecheck — pass.
  • git diff --check origin/main...HEAD — pass.

GitHub checks on the latest commit are green: Build, Test, Typecheck, Lint, Validate Evals, Validate Marketplace, Check Links, and Cloudflare Pages all pass.

Residual risk: this was a targeted re-review of the prior blockers rather than a full functional UAT pass. Given the local targeted coverage plus green CI, I do not see remaining blocking risk for PR #1400.

@christso christso marked this pull request as ready for review June 17, 2026 11:25
@christso christso merged commit 51bb245 into main Jun 17, 2026
8 checks passed
@christso christso deleted the av-results-row-schema-cleanup branch June 17, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant